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Abstract 

For many types of learners one can compute the statistically “optimal” way to select data. We review how 
these techniques have been used with feedforward neural networks [MacKay, 1992; Cohn, 1994]. We then 
show how the same principles may be used to select data for two alternative, statistically-based learning 
architectures: mixtures of Gaussians and locally weighted regression. While the techniques for neural 
networks are expensive and approximate, the techniques for mixtures of Gaussians and locally weighted 
regression are both efficient and accurate. 
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1 ACTIVE LEARNING - 
BACKGROUND 

An active learning problem is one where the learner has 
the ability or need to influence or select its own training 
data. Many problems of great practical interest allow 
active learning, and many even require it. 

We consider the problem of actively learning a map¬ 
ping X —>■ Y based on a set of training examples 
{(xi, i , where Xi £ X and yi £ Y. The learner 

is allowed to iteratively select new inputs x (possibly 
from a constrained set), observe the resulting output y, 
and incorporate the new examples (x, y) into its training 
set. 

The primary question of active learning is how to 
choose which x to try next. There are many heuristics for 
choosing x based on intuition, including choosing places 
where we don’t have data [Whitehead, 1991], where we 
perform poorly [Linden and Weber, 1993], where we have 
low confidence [Thrun and Moller, 1992], where we ex¬ 
pect it to change our model [Cohn et al, 1990], and 
where we previously found data that resulted in learning 
[Schmidhuber and Storck, 1993]. 

In this paper we consider how one may select x “op¬ 
timally” from a statistical viewpoint. We first review 
how the statistical approach can be applied to neu¬ 
ral networks, as described in MacKay [1992] and Cohn 
[1994]. We then consider two alternative, statistically- 
based learning architectures: mixtures of Gaussians and 
locally weighted regression. While optimal data selec¬ 
tion for a neural network is computationally expensive 
and approximate, we find that optimal data selection for 
the two statistical models is efficient and accurate. 


tures, however, provide an estimate of P(y\x) based on 
current data, so we can use this estimate to compute the 
expectation of <f?. Selecting x to minimize the expected 
integrated variance provides a solid statistical basis for 
choosing new examples. 

2.1 EXAMPLE: ACTIVE LEARNING WITH 
A NEURAL NETWORK 

In this section we review the use of techniques from Op¬ 
timal Experiment Design (OED) to minimize the es¬ 
timated variance of a neural network [Fedorov, 1972; 
MacKay, 1992; Cohn, 1994]. We will assume we have 
been given a learner y = (), a training set {( Xi , 

and a parameter vector w that maximizes a likeli¬ 
hood measure. One such measure is the minimum sum 
squared residual 

1 m 

s 2 = — ^{yi - y(xi)f ■ 

2 = 1 

The estimated output variance of the network is 

2^02 (dy(x)\ T (d 2 S 2 V 1 ( dy(x) 
v \ dw ) \ dw 2 ) \ dw 

The standard OED approach assumes normality and 
local linearity. These assumptions allow replacing the 
distribution P(y\x) by its estimated mean y(x) and vari¬ 
ance S 2 . The expected value of the new variance, <r|, is 
then: 

<*!> * *1 ~ 1992] ' (2) 


2 ACTIVE LEARNING - A 
STATISTICAL APPROACH 

We denote the learner’s output given input x as y(x). 
The mean squared error of this output can be expressed 
as the sum of the learner’s bias and variance. The vari¬ 
ance t?(x) indicates the learner’s uncertainty in its esti¬ 
mate at x. 1 Our goal will be to select a new example x 
such that when the resulting example (x, y) is added to 
the training set, the integrated variance IV is minimized: 



Here, P(x) is the (known) distribution over X. In prac¬ 
tice, we will compute a Monte Carlo approximation of 
this integral, evaluating <7? at a number of random points 
drawn according to P(x). 

Selecting x so as to minimize IV requires comput¬ 
ing <r|, the new variance at x given (x,y). Until we 
actually commit to an x, we do not know what corre¬ 
sponding y we will see, so the minimization cannot be 
performed deterministically. 2 Many learning architec¬ 


where we define 


Uy(x, x) = S 2 


dy(x) \ T 
dw ) 


d 2 s 2 y 1 f dyjx) 

dw 2 ) \ dw 


For empirical results on the predictive power of Equa¬ 
tion 2, see Cohn [1994]. 

The advantages of minimizing this criterion are that 
it is grounded in statistics, and is optimal given the as¬ 
sumptions. Furthermore, the criterion is continuous and 
differentiable. As such, it is applicable in continuous 
domains with continuous action spaces, and allows hill¬ 
climbing to find the “best” x. 

For neural networks, however, this approach has many 
disadvantages. The criterion relies on simplifications 
and strong assumptions which hold only approximately. 
Computing the variance estimate requires inversion of a 
|u>| x |u>| matrix for each new example, and incorporat¬ 
ing new examples into the network requires expensive 
retraining. Paass and Kindermann [1995] discuss an ap¬ 
proach which addresses some of these problems. 


3 MIXTURES OF GAUSSIANS 


1 Unless explicitly denoted, y and <r? are functions of x. 
For simplicity, we present our results in the univariate setting. 
All results in the paper extend easily to the multivariate case. 

2 This contrasts with related work by Plutowski and White 
[1993], which is concerned with filtering an existing data set. 


The mixture of Gaussians model is gaining popularity 
among machine learning practitioners [Nowlan, 1991; 
Specht, 1991; Ghahramani and Jordan, 1994]. It as¬ 
sumes that the data is produced by a mixture of N Gaus¬ 
sians gi, for i = 1, ..., AG We can use the EM algorithm 



[Dempster et al, 1977] to find the best fit to the data, 
after which the conditional expectations of the mixture 
can be used for function approximation. 

For each Gaussian gi we will denote the estimated in¬ 
put/output means as y x g and y y g and estimated covari¬ 
ances as <7 2 i) a 2 i and u xy g. The conditional variance of 
y given x may tfien be written 


a y\x,i a y,i 2 ' 

u x,i 

We will denote as w 8 - the (possibly fractional) number 
of training examples for which gi takes responsibility: 

n . _ P( x j,yj\ l ) 

For an input x, each gi has conditional expectation jji 
and variance < 7 ? „•: 


- . '-'xy.i / \ 

Vi — }^y ,i + 2 


(x - fi x ,i) 2 


These expectations and variances are mixed according 
to the prior probability that gi has of being responsible 
for x : 

,,, s ft iW = 

For input x then, the conditional expectation y of the 
resulting mixture and its variance may be written: 


= I> 


N h 2 rr 2 

_ u y\x,i 

“ rii 

Z = 1 


(x - y Xt i f 


In contrast to the variance estimate computed for a neu¬ 
ral network, here < 7 ? can be computed efficiently with no 
approximations. 

3.1 ACTIVE LEARNING WITH A 
MIXTURE OF GAUSSIANS 

We want to select Z to minimize (a 2 ~ Y With a mixture of 


Gaussians, the model’s estimated distribution of y given 
x is explicit: 

N N 

P(y\x) = ^2'hiP{y\i,i) = ^2'hiN{yi{i),(T 2 y ^.(i)) ) 


where hi = hi(Z). Given this, calculation of ^<7?y is 

straightforward: we model the change in each gi sep¬ 
arately, calculating its expected variance given a new 
point sampled from P(y\Z, i) and weight this change by 
hi. The new expectations combine to form the learner’s 
new expected variance 


N hj ( < 7 2 , . 

_ * \ 2/k,* 

i~i ni T hi 


l+ ( * ) (3) 


where the expectation can be computed exactly in closed 
form: 


na l,i nhj(x - y Xti ) 2 
n + hi (n + hi ) 2 
wcr 2 ,. + hjCrli(Z) nhj(yi(x) - y y g) 2 
n + hi (n + hi) 2 

n(T X y t i nhj(x - fi Xt i)(yj(x) - /i yti ) 
n + hi (n + hi) 2 


^hl^ygjx)^ ~ y x ,if 
(n + hi ) 4 


4 LOCALLY WEIGHTED 
REGRESSION 

We consider here two forms of locally weighted regression 
(LWR): kernel regression and the LOESS model [Cleve¬ 
land et al, 1988]. Kernel regression computes y as an 
average of the t/ 8 - in the data set, weighted by a kernel 
centered at x. The LOESS model performs a linear re¬ 
gression on points in the data set, weighted by a kernel 
centered at x. The kernel shape is a design parameter: 
the original LOESS model uses a “tricubic” kernel; in 
our experiments we use the more common Gaussian 

hi(x) = h(x — Xi) = exp(— k(x — Xi) 2 ), 
where A; is a smoothing constant. For brevity, we will 
drop the argument x for hi(x), and define w = JV/ij-. 
We can then write the estimated means and covariances 


E; h iVi 


2 _ 2 

a y\x ~ a y ~ 


Ei Mu - x f 

n 

E i h i(Vi -hy f 

n 

_ Ei Mu - x)(y { - fi y ) 


We use them to express the conditional expectations and 
their estimated variances: 

kernel: y = fi y , 

2_^ 

®y 

* rt 


LOESS: 


y = hy + M f(x - y. x ), 


<7 , 

2 _ y\ x 
y n 


(x - fl x f 


4.1 ACTIVE LEARNING WITH LOCALLY 
WEIGHTED REGRESSION 


Again we want to select Z to minimize j. With LWR, 
the model’s estimated distribution of y given Z is explicit: 
P(v\x) = N(y(Z),a 2 Ax (Z)) 


2 



The estimate of yl ) 18 also explicit. Defining h as the 

weight assigned to £ by the kernel, the learner’s expected 
new variance is 

kernel: (<7;) = -^ 

X y/ n + h 

LOESS: (a 2 ) = i+Jtl f 1 + (a! 

X V/ n + h V *1 ) 

where the expectation can be computed exactly in closed 
form: 
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5 EXPERIMENTAL RESULTS 

Below we describe two sets of experiments demonstrat¬ 
ing the predictive power of the query selection criteria in 
this paper. In the first set, learners were trained on data 
from a noisy sine wave. The criteria described in this pa¬ 
per were applied to predict how a new training example 
selected at point £ would decrease the learner’s variance. 
These predictions, along with the actual changes in vari¬ 
ance when the training points were queried and added, 
are plotted in Figures 1, 2, 3, and 4. 

In the second set of experiments, we applied the tech¬ 
niques of this paper to learning the kinematics of a two- 
joint. planar arm (Figure 5; see Cohn [1994] for details). 
Below, we illustrate the problem using the LOESS algo¬ 
rithm. 

An example of the correlation between predicted and 
actual changes in variance on this problem is plotted in 
Figure 6. Figures 7 and 8 demonstrate that this cor¬ 
relation may be exploited to guide sequential query se¬ 
lection. We compared a LOESS learner which selected 
each new query so as to minimize expected variance with 
LOESS learners which selected queries according to var¬ 
ious heuristics. The variance-minimizing learner signifi¬ 
cantly outperforms the heuristics in terms of both vari¬ 
ance and MSE. 

6 SUMMARY 

Mixtures of Gaussia.ns and locally weighted regression 
are two statistical models that offer elegant representa¬ 
tions and efficient learning algorithms. In this paper we 



Figure 1: The upper portion of the plot indicates the 
neural network’s fit to noisy sinusoidal data. The 
lower portion of the plot indicates predicted and ac¬ 
tual changes in the network’s average estimated vari¬ 
ance when £ is queried and added to the training set, for 
£ G [0, 1]. Changes are not plotted to scale with fits. 



Figure 2: Fit to data and correlation for a mixture of 
Gaussia.ns. 


have shown that, they also offer the opportunity to per¬ 
form active learning in an efficient, and statistically cor¬ 
rect. manner. The criteria, derived here can be computed 
cheaply and, for problems tested, demonstrate good pre¬ 
dictive power. 
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Figure 7: Variance for a LOESS learner selecting queries 
according to the variance-minimizing criterion discussed 
in this paper and according to several heuristics. “Sen¬ 
sitivity” queries where output is most sensitive to new 
data, “Bias” queries according to a bias-minimizing cri¬ 
terion, “Support” queries where the model has the least 
data support. The variancj^of “Random” and “Sensitiv¬ 
ity” are off the scale. Curves are medians over 15 runs 
with non-Gaussian noise. 
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Figure 8: MSE for a LOESS learner selecting queries 
according to the variance-minimizing criterion discussed 
in this paper and according to the heuristics described 
in the previous figure. 
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